These are the projects and labs completed for Data Visualization (STAT 302) - Spring 2020.


Lab 01 ggplot overview

Overview

The goals of this lab are to (1) ensure that the major software for this course is properly installed and functional, (2) develop and follow a proper workflow, and (3) work together to construct a few plots to explore a dataset using ggplot2 — demonstration of the utility and power of ggplot2.

Don’t worry if you cannot do everything here by yourself. You are just getting started and the learning curve is steep, but remember that the instructional team and your classmates will be there to provide support. Persevere and put forth an honest effort and this course will payoff.

Load Packages tidyverse, ggstance, skimr


Dataset

We’ll be using data from the lego package which is already in the /data subdirectory, along with many other processed datasets, as part of the zipped folder for this lab.

Exercise 1

Let’s look at some interesting patterns in the history of LEGO! We’ll be using data from the lego package located data/legosets.rda. We will work through this exercise together in class.

1a Inspect the data

The lego package provides a helpful dataset some interesting variables. Let’s take a quick look at the data.

## Rows: 6,172
## Columns: 14
## $ Item_Number  <chr> "10246", "10247", "10248", "10249", "10581", "10582", ...
## $ Name         <chr> "Detective's Office", "Ferris Wheel", "Ferrari F40", "...
## $ Year         <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ Theme        <chr> "Advanced Models", "Advanced Models", "Advanced Models...
## $ Subtheme     <chr> "Modular Buildings", "Fairground", "Vehicles", "Winter...
## $ Pieces       <int> 2262, 2464, 1158, 898, 13, 39, 32, 105, 13, 11, 52, 13...
## $ Minifigures  <int> 6, 10, NA, NA, 1, 2, 2, 3, 2, 2, 3, 1, NA, NA, NA, NA,...
## $ Image_URL    <chr> "http://images.brickset.com/sets/images/10246-1.jpg", ...
## $ GBP_MSRP     <dbl> 132.99, 149.99, 69.99, 59.99, 9.99, 16.99, 19.99, 49.9...
## $ USD_MSRP     <dbl> 159.99, 199.99, 99.99, 79.99, 9.99, 19.99, 24.99, 59.9...
## $ CAD_MSRP     <dbl> 199.99, 229.99, 119.99, NA, 12.99, 24.99, 29.99, 69.99...
## $ EUR_MSRP     <dbl> 149.99, 179.99, 89.99, 69.99, 9.99, 19.99, 24.99, 59.9...
## $ Packaging    <chr> "Box", "Box", "Box", "Box", "Box", "Box", "Box", "Box"...
## $ Availability <chr> "Retail - limited", "Retail - limited", "LEGO exclusiv...
Data summary
Name legosets
Number of rows 6172
Number of columns 14
_______________________
Column type frequency:
character 7
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Item_Number 0 1 1 13 0 5854 0
Name 0 1 2 73 0 5519 4
Theme 0 1 4 28 0 115 0
Subtheme 0 1 0 32 2206 358 1
Image_URL 0 1 46 58 0 6172 0
Packaging 0 1 3 21 0 14 0
Availability 0 1 6 21 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Year 0 1.00 2004.71 8.91 1971.00 2000.00 2006.00 2012.00 2015.00
Pieces 112 0.98 215.17 356.20 0.00 30.00 82.00 256.25 5922.00
Minifigures 2672 0.57 2.85 2.72 1.00 1.00 2.00 4.00 32.00
GBP_MSRP 1980 0.68 23.45 31.93 0.00 5.99 12.99 29.99 509.99
USD_MSRP 355 0.94 27.90 39.32 0.00 6.00 14.99 34.99 789.99
CAD_MSRP 4190 0.32 46.34 58.46 2.99 12.99 24.99 54.99 789.99
EUR_MSRP 4399 0.29 35.98 46.61 0.00 9.99 19.99 39.99 699.99

Notice there are a lot of missing variables, especially when it comes to pricing - this will be important for when we calculate the means.


1c Pieces per year

Next, let’s look at how the number of pieces per set has changed over time. Because Duplo sets are much smaller (since they’re designed for toddlers), we’ll make a special indicator variable for them.

Lab 02 aesthetics

Overview

The goal of this lab is to begin the process of unlocking the power of ggplot2 through constructing and experimenting with a few basic plots.

Datasets

We’ll be using data from the blue_jays.rda dataset which is already in the /data subdirectory in our data_vis_labs project. Below is a description of the variables contained in the dataset.

  • BirdID - ID tag for bird
  • KnownSex - Sex coded as F or M
  • BillDepth - Thickness of the bill measured at the nostril (in mm)
  • BillWidth - Width of the bill (in mm)
  • BillLength - Length of the bill (in mm)
  • Head - Distance from tip of bill to back of head (in mm)
  • Mass - Body mass (in grams)
  • Skull - Distance from base of bill to back of skull (in mm)
  • Sex - Sex coded as 0 = female or 1 = male

We’ll also be using a subset of the BRFSS (Behavioral Risk Factor Surveillance System) survey collected annually by the Centers for Disease Control and Prevention (CDC). The data can be found in the provided cdc.txt file — place this file in your /data subdirectory. The dataset contains 20,000 complete observations/records of 9 variables/fields, described below.

  • genhlth - How would you rate your general health? (excellent, very good, good, fair, poor)
  • exerany - Have you exercised in the past month? (1 = yes, 0 = no)
  • hlthplan - Do you have some form of health coverage? (1 = yes, 0 = no)
  • smoke100 - Have you smoked at least 100 cigarettes in your life time? (1 = yes, 0 = no)
  • height - height in inches
  • weight - weight in pounds
  • wtdesire - weight desired in pounds
  • age - in years
  • gender - m for males and f for females

Notice we are setting a seed. This signifies we will be doing something that relies on a random process (e.g., random sampling). In order for our results to be reproducible we set the seed. This ensures that every time you run the code or someone else does, it will produce the exact same output. It is good coding etiquette to set the seed towards the top of your document/code.

Exercises

Complete the following exercises.


Exercise 1

Using blue_jay dataset, construct the following scatterplots of Head by Mass:

  1. One with the color aesthetic set to Northwestern purple (#4E2A84), shape aesthetic set a solid/filled triangle, and size aesthetic set to 2.
  2. One using Sex or KnownSex mapped to the color aesthetic. That is, determine which is more appropriate and explain why. Also set the size aesthetic to 2.


Consider the color aesthetic in the plots for (1) and (2). Explain why these two usages of the color aesthetic are meaningfully different.


##       BirdID KnownSex BillDepth BillWidth BillLength  Head  Mass Skull Sex
## 1 0000-00000        M      8.26      9.21      25.92 56.58 73.30 30.66   1
## 2 1142-05901        M      8.54      8.76      24.99 56.36 75.10 31.38   1
## 3 1142-05905        M      8.39      8.78      26.07 57.32 70.25 31.25   1
## 4 1142-05907        F      7.78      9.30      23.48 53.77 65.50 30.29   0
## 5 1142-05909        M      8.71      9.84      25.47 57.32 74.90 31.85   1
## 6 1142-05911        F      7.28      9.30      22.25 52.25 63.90 30.00   0

Exercise 1.2


The first plot uses colors for a purley subjective reason, a preferance for purple, but does not have any aesthetic effect on the data. The second plot uses the aesthetic color to section the data into two groups (Female and Male); this plot is using color as part of the mapping of the data to aesthetics.


Exercise 2

Using a random subsample of size 100 from the cdc dataset (code provided below), construct a scatterplot of weight by height. Construct 5 more scatterplots of weight by height that make use of aesthetic attributes color and shape (maybe size too). You can define both aesthetics at the same time in each plot or one at a time. Just experiment. — Should be six total plots.


Lab 03 categories & geom_smooth

Overview

The goal of this lab is to continue the process of unlocking the power of ggplot2 through constructing and experimenting with a few basic plots.

Datasets

We’ll be using data from the BA_degrees.rda and dow_jones_industrial.rda datasets which are already in the /data subdirectory in our data_vis_labs project. Below is a description of the variables contained in each dataset.

BA_degrees.rda

  • field - field of study
  • year_str - academic year (e.g. 1970-71)
  • year - closing year of academic year
  • count - number of degrees conferred within a field for the year
  • perc - field’s percentage of degrees conferred for the year

dow_jones_industrial.rda

  • date - date
  • open - Dow Jones Industrial Average at open
  • high - Day’s high for the Dow Jones Industrial Average
  • low - Day’s low for the Dow Jones Industrial Average
  • close - Dow Jones Industrial Average at close
  • volume - number of trades for the day

We’ll also be using a subset of the BRFSS (Behavioral Risk Factor Surveillance System) survey collected annually by the Centers for Disease Control and Prevention (CDC). The data can be found in the provided cdc.txt file — place this file in your /data subdirectory. The dataset contains 20,000 complete observations/records of 9 variables/fields, described below.

  • genhlth - How would you rate your general health? (excellent, very good, good, fair, poor)
  • exerany - Have you exercised in the past month? (1 = yes, 0 = no)
  • hlthplan - Do you have some form of health coverage? (1 = yes, 0 = no)
  • smoke100 - Have you smoked at least 100 cigarettes in your life time? (1 = yes, 0 = no)
  • height - height in inches
  • weight - weight in pounds
  • wtdesire - weight desired in pounds
  • age - in years
  • gender - m for males and f for females

Exercises

Exercise 1

The following exercises use the BA_degrees data set.

## # A tibble: 6 x 6
##   field    year_str  year  count  perc mean_perc
##   <fct>    <chr>    <dbl>  <dbl> <dbl>     <dbl>
## 1 Business 1970-71   1971 115396 0.137     0.204
## 2 Business 1975-76   1976 143171 0.155     0.204
## 3 Business 1980-81   1981 200521 0.214     0.204
## 4 Business 1985-86   1986 236700 0.240     0.204
## 5 Business 1990-91   1991 249165 0.228     0.204
## 6 Business 1995-96   1996 226623 0.195     0.204


Exercise 2

The following exercises use the dow_jones-industrial data set.

## # A tibble: 6 x 6
##   date        open  high   low close    volume
##   <date>     <dbl> <dbl> <dbl> <dbl>     <int>
## 1 2008-12-31 8666. 8843. 8665. 8776. 226760000
## 2 2009-01-02 8772. 9065. 8761. 9035. 213700000
## 3 2009-01-05 9027. 9034. 8892. 8953. 233760000
## 4 2009-01-06 8955. 9088. 8941. 9015. 215410000
## 5 2009-01-07 8997. 8997. 8720. 8770. 266710000
## 6 2009-01-08 8770. 8770. 8651. 8742. 226620000


Exercise 3

The following exercises use the cdc dataset.

## # A tibble: 6 x 9
##   genhlth   exerany hlthplan smoke100 height weight wtdesire   age gender
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>  <dbl>    <dbl> <dbl> <chr> 
## 1 good            0        1        0     70    175      175    77 m     
## 2 good            0        1        1     64    125      115    33 f     
## 3 good            1        1        1     60    105      105    49 f     
## 4 good            1        1        0     66    132      124    42 f     
## 5 very good       0        1        0     61    150      130    55 f     
## 6 very good       1        1        0     64    114      114    55 f

Lab 04 geom_text & annotation

Overview

The goal of this lab is to continue the process of unlocking the power of ggplot2 through constructing and experimenting with a few basic plots.

Load Packages: tidyverse, gridExtra, ggrepel

Exercises

Exercise 1

The following plot uses the blue_jays.rda dataset.

#create caption that automatically grabs number of blue jays  
caption <- paste("Head length versus body mass for", nrow(blue_jays), "blue jays")

#add string wrap 
#will break the caption into multiple lines if longer than 40 characters
# '\n' is the line break code 
caption_print <- paste(strwrap(caption, 40), collapse ="\n") 

#create data set for top head size for each sex
topHead <- blue_jays %>% 
  #arrange largest to smallest
  arrange(desc(Head)) %>%
  # group by sex
  group_by(KnownSex) %>% 
  #take the top 2 head sizes  for each group 
  top_n(n = 2, wt = Head)

#'M' label will be put on the top male head size 
#'F' label will be put on the 2nd top female head size 
Labels <- topHead[c(1,4),]

#ANOTHER OPTIONS: lable dataframe 
#search by BirdID
Labels_anotheroption <- blue_jays  %>% 
  #select specific bird where you want the labels 
  filter(BirdID %in% c("1142-05914", "702-90567"))

#get range for x and y variables 
xrng <- range(blue_jays$Mass)
yrng <- range(blue_jays$Head)

#head length by body mass 
ggplot(blue_jays, aes(Mass, Head, color = KnownSex)) +
  geom_point(alpha = 0.6, size = 2) +
  annotate(
     "text"
    #put text in the top left corner of plot 
    , x = xrng[1], y = yrng[2]
    #label is the caption already create
    , label = caption
    #left justify 
    , hjust = 0
    # bottom justify 
    , vjust = 1
    #size the font
    , size = 4
    ) +
  xlab("Body mass (g)") +
  ylab("Head length (mm)") +
  #remove all legends; to remove just one legend put show.legend = FALSE into geom
  theme(legend.position = "none") +
  # add labels 
  geom_text(
    #use labels data set 
      data = Labels
    , aes(label = KnownSex)
    #nudget labels to the right 
    , nudge_x = 0.5
    )


Exercise 2

The following plots use the tech_stocks dataset.

Exercise 4

The next plot uses the cdc dataset.

Using Bilbo Baggins’ responses below to the CDC BRSFF questions, add Bilbo’s data point as a transparent (0.5) solid red circle of size 4 to a scatterplot of weight by height with transparent (0.1) solid blue circles of size 2 as the plotting characters. In addition, label the point with his name in red. Left justify and rotate the label so it reads vertically from bottom to top — shift it up by 10 pounds too. Plot should use appropriately formatted axis labels. Remember that the default shape is a solid circle.

  • genhlth - How would you rate your general health? fair
  • exerany - Have you exercised in the past month? 1=yes
  • hlthplan - Do you have some form of health coverage? 0=no
  • smoke100 - Have you smoked at least 100 cigarettes in your life time? 1=yes
  • height - height in inches: 46
  • weight - weight in pounds: 120
  • wtdesire - weight desired in pounds: 120
  • age - in years: 45
  • gender - m for males and f for females: m


Hint: Create a new dataset (maybe call it bilbo or bilbo_baggins) using either data.frame() (base R - example in book) or tibble() (tidyverse - see help documentation for the function). Make sure to use variable names that exactly match cdc’s variable names

Lab 05 2d surface & geospatial

Overview

The goal of this lab is to explore more useful plots in ggplot2. Specifically we will be focusing on surface plots and geospatial plots (maps).

Challenges are not mandatory for students to complete. We highly recommend students attempt them though. We would expect graduate students to attempt the challenges.

Exercises

Complete the following exercises.


Challenge(s)

The following plots use the tidycensus package and few others, as well as using these directions.

Try using a different geographical area and a different variable from the ACS.

Plot 1: Manhatten and Household Median Income

Plot 2: Twin Cities and Percentage of Renter-occupied Units

Lab 06 errorsbars & layers

Overview

The goal of this lab is to explore more plots in ggplot2. Specifically we will be focusing on error bars for uncertainty and practice using multiple layers.

Exercises